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1. INTRODUCTION 

One of the diseases that cause death in the world is cancer. Cancer is the second leading cause of 
death globally [1]. Detecting these diseases when still at an early stage is associated with markedly improved 
survival prospects [2], [3]. Early-stage of the cancer is more likely to treat [4]. Colorectal cancer is cancer 
with the third death rate. responsible for around 600,000 per year worldwide [5]-[8]. Information technology 
has an important role in the field of medicine. Cancer is a disease that can be detected by machine learning. 
Data is very useful in the medical field. It can be seen from the development of data mining in medical 
science is increasing rapidly. This increase can be seen from the high prediction results, can reduce treatment 
costs, increase the chances of recovery of patients, and decisions to save lives [9], [10]. 

Machine learning is an application of artificial intelligence that provides systems the ability to 
automatically learn and improve from experience without being explicitly programmed [11]. One method 
that is popular because the learning performance is very good is the twin support vector machine (SVM) [12]. 
Kernel method is a method that uses functions when the algorithm operates in feature space with a higher 
dimension. This process uses product operations between images, all feature pairs. This method is used 
directly or indirectly by a SVM and twin SVM to classify data [13]. The kernel functions commonly used for 
SVM methods are linear kernel, polynomial kernel, RBF kernel, and gaussian kernel. This paper proposes the 
twin SVM method as a novel approach for the early detection of colorectal cancer. The kernel functions used 
are the linear kernel, polynomial kernel, RBF kernel, and gaussian kernel. This paper compares the 
performance of the twin SVM with each kernel to get the best kernel for the detection of colorectal cancer. 
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2. RESEARCH METHOD 
2.1. Twin support vector machine 

SVM is a method used to find a single hyperplane to classify samples [14] proposed twin SVM is 
found where samples are given to classes with two hyperplanes according to their distance from their 
hyperplanes. Equations of the two hyperplanes are as: 


wix, + b, =0 
wix, + b, =0 


i-th hyperline parameters shown by w; and b;. Each hyperline is closest to its class sample, non- 
parallel in nature, and farthest from the opposite class sample. Assume a binary classification task with 
classes +1 and —1, and A € R™*@ and B € R™*@ indicate each matrix has a sample with each class +1 and 
-1 [15]. Based on the appropriate class, one sample is shown with each matrix row. The two hyperplanes of 
twin SVM obtained from (1) and (2): 


min = (Aw, + eb)" (Aw, + eb,) + peté 
s.t — (Bw +eb,)+ë 2e, >20 (1) 
min = (Bw, + eb)" (Bw, + eb.) + preté 
s.t — (Aw, +eb,) +&é>e,FS0 (2) 


& is a non-negative vector component, therefore € > 0. Vector of the size slack variable n represented 
by e. letting the margin of decision make a few mistakes is the standard approach. a standard approach is 
taken if the sampling service cannot be separated linearly. (for example, some points are in or on the wrong 
margin). the cost for a wrong-classified sample that is proportional to the distance between the sample and 
the decision margin is determined by each zero-zero element of the slack variable vector. Based on these 
equations, p4 and pz are penalty parameters. Twin SVM is in great demand in various fields with various 
versions of the proposed algorithm [16]. Recently, several fuzzy formulations from twin SVM have also been 
proposed [17] 


2.2. Kernel function 

Kernel method is a method that uses kernel functions to operate algorithms in feature spaces that 
have higher dimensions. This method uses product operations between images of all image pairs in the 
feature space [18]. Accuracy for classifying objects in the right cluster is difficult to obtain in high 
dimensional data sets, measuring euclidean distances on k-means, c-means, or fuzzy c-medoids. Distribution 
data can be represented to validate the truly central cluster. This difficulty can be overcome by using the 
kernel method [19]. Let X” be an input space; F is a feature space and 6 : Xn —F. In (3) defines kernel 
functions [20], [21]: 


K(x, x2) = (x1) (x2) (3) 


where x1, X3 E X”. 
Kernel functions that are often used are linear kernel, polynomial kernel, RBF kernel, and gaussian kernel. 
Table 1 lists the formulas for kernel functions [22]-[23]: 


Table 1. The formula of kernel function 








Kernel Function Formula 
Linear Kernel K(x1,x2) = xIx, +C 
Polynomial Kernel K (x1, x2) = (yx7 x, + c)$f;y > 0 
RBF Kernel Kx) = eleze? y > 0 
Gaussian Kernel _lix1-x2ll? 


K(x, x) =e 20 





2.3. k-Fold cross validation 
The dataset is divided into two, i.e training data and testing data. This is done so that the resulting 
model can be evaluated and obtained. Colorectal cancer data patterns are studied and recognized by machines 
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with training data. Testing data are data used to evaluate models obtained after a machine learns data patterns 
[24]. By using the k-fold cross validation method, the dataset is divided into training data and testing data 
[25]. Training data samples were selected by the k-fold cross validation method. This method works by 
dividing the dataset with k-parts of the same size. Models and repetition of processes k times tested for each 
subsample taken as validation data. 


2.4. Proposed method 

Several stages are proposed in this study, including data divided into training and testing data. then 
the data is tested with k-fold cross validation. The k-value chosen was 10 and 45 for the random state. This 
means that the dataset was divided into 10 samples of the same size. In the second stage, the training data 
were used by the twin SVM method based on linear kernel, polynomial kernel, RBF kernel, and gaussian 
kernel to study data patterns and build classification models. The next step is to classify the models obtained 
and evaluated based on the parameters of accuracy and running time. To find the best kernel, the evaluation 
parameters produced by each kernel are compared. 


3. RESULTS AND ANALYSIS 

This research using Jupyter Notebook as software for running the program of twin SVM using linear 
kernel, polynomial kernel, RBF kernel, and gaussian kernel. The stages carried out in this paper using the 
Python 3 programming language. 


3.1. Data 

In this study, the data consisted of 210 samples and seven features. these seven features consist of 
CEA, hemoglobin, leukocytes, hematocrit, platelets, age. diagnosis features become a target feature in 
detecting colorectal cancer. The data are colorectal cancer data obtained from Al-Islam Hospital, Bandung, 
Indonesia with cancer diagnoses (1), and no cancer (0). Table 2 represented part of the data: 


Table 2. Part of colorectal cancer data 
Age CEA Hemoglobin Leukocyte Hematocrit _ Platelets _ Diagnosis 








74 3.26 11.8 19400 37.3 341000 0 
84 29.12 8 12400 26.6 465000 1 
81 4.5 8.8 19900 26.2 468000 0 
56 0.96 13.9 9400 41.5 260000 0 
75 3.24 77 13500 22.5 377000 0 
58 0.71 11 18200 34 259000 0 
63 1.65 10.1 19900 32.1 151000 0 
73 36.49 11.1 9700 33.4 267000 1 





3.2. Confusion matrix 

In this paper, a confusion matrix was used to assist in calculating the evaluation parameters of the 
classification model. Table 2 shows the confusion matrix used to evaluate the twin SVM classification model 
based on the kernel for the diagnosis of colorectal cancer. Table 3 shown confusion matrix. 


Table 3. Confusion Matrix 











Predict 
Cancer (Y) Non-Cancer (N) 
eer Cancer (Y) TP FN 
Non Cancer (N) FP TN 





Explanation: 

TP (true positive): many cases of colorectal cancer are predicted to be correct 

TN (true negative): many cases of not colorectal cancer are predicted to be correct 

FP (false positive): many cases of not colorectal cancer are predicted to be wrong (predicted as colorectal 
cancer) 

FN (false negative): many colorectal cancer cases are predicted to be wrong (predicted as not pancreatic cancer) 


3.3. Evaluation parameters 
The parameters to evaluate the performance of the twin SVM classification model were accuracy and 
required running time. In 4 shows the formula for accuracy: 
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(TN+TP) 


Accuracy = ———————- 
y (FN+TP+FP+TN) 


x 100% (4) 


Accuracy is used to compare the number of cases of colorectal cancer and not colorectal cancer that 
identified correctly with the total number of cases. 


3.4. Results 

In this section, we discuss the performance evaluation of the twin SVM classification model with 
linear kernel, polynomial kernel, RBF kernel, and gaussian kernel. The twin SVM classification model based 
on kernel detects colorectal cancer using a twin SVM with a linear kernel, polynomial kernel, RBF kernel, 
and gaussian kernel. In this research, the highest accuracy is from the polynomial kernel. This indicates that 
the polynomial kernel is the appropriate kernel in detecting colorectal using a twin support vector machine. 
In this paper, we have built the twin SVM classification model with linear kernels, polynomial kernels, radial 
basis function kernels, and gaussian kernels in detecting colorectal cancer. Table 4 presents a comparison of 
twin SVM performance linear kernel, polynomial kernel, RBF kernel, and gaussian kernel. All kernel 
parameter is 1. The performance evaluation parameters compared are accuracy and running time. Table 4 
shows the result of the accuracy and running time twin SVM classification model based on kernel. 


Table 4. Results of the twin SVM classification model based on kernel 
Classification Model Accuracy (%) Running Time (seconds) 








Linear Kernel 81% 0.565 
Polynomial Kernel 86% 0.502 
RBF Kernel 76% 1.605 
Gaussian Kernel 76% 1.612 





Based on Tabel 4, that can be seen that for accuracy, twin SVM models the highest accuracy of 86% 
was recorded when using the polynomial kernel at 0.502 seconds. While the lowest accuracy at 76% was 
recorded when RBF and Gaussian kernel with a running time of 1.605 seconds for RBF kernel and 1.612 for 
the gaussian kernel. For consideration of running time, the twin SVM model with polynomial kernel has the 
fastest running time compared to linear, RBF, and gaussian kernels, which is around 0.502 s. The twin SVM 
model with the gaussian kernel actually produces the longest running time which is around 1.612 s. Based on 
the results obtained, the polynomial kernel gets the best results in terms of accuracy and running time. Thus, 
the polynomial kernel is the best kernel for the twin SVM in detecting colorectal cancer dataset. 


4. CONCLUSION 

Colorectal cancer detection quickly is very important. it is useful for handling cancer quickly before 
being infected to all organs of the body. However, this is difficult because colorectal cancer has no specific 
symptoms. The twin SVM method can help detect colorectal cancer based on blood tests and age. The most 
appropriate kernel for the twin SVM method in detecting colorectal cancer is the polynomial kernel which 
produces an accuracy of 86% and the required running time is 0.502 seconds. 
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